Measure Concentration of Hidden Markov Processes
Leonid Kontorovich
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
USA
llkontor@cs.cmu.edul
February 2, 2008
Abstract
We prove what appears to be the first concentration of measure result for hidden Markov
processes. Our bound is stated in terms of the contraction coefficients of the underlying Markov
process, and strictly generalizes the Markov process concentration results of Marton (1996) and
Samson (2000). Somewhat surprisingly, the hidden Markov process is at least as "concentrated"
as its underlying Markov process; this property, however, fails for general hidden/observed pro-
cess pairs.
1 Introduction
Recently several general techniques have been developed for proving concentration results for
nonproduct measures ^0J|Hj (see the references cited in for a brief overview). Let (S,J-)
be a Borel- measurable space, and consider the probability space (S n , T n , fx) with the associated
random process (Xi)x<i< n , Xi G S. Suppose further that S n is equipped with a metric d. For
our purposes, a concentration of measure result is an inequality stating that for any 1-Lipschitz
(with respect to d) function / : S n — + R, we have
F{\f(X)-Ef(X)\>t} < 2ex P (-Kt 2 ) (1)
where K may depend on n but not on f. 1 The quantity fjij, defined below, has proved useful
for obtaining concentration results. For 1 < i < j < n, y G iS' _1 and w G S, let
C{X^\X[- l =y,X i = w)
be the law of X™ conditioned on X{~ x — y and Xi — w. Define
Vl3 (y,w,w') = \\L{X?\Xi- x =y,X i = w)-C{X?\X i 1 - 1 =y,X i = w')\\ TV
and
fjij = sup sup r)ij(y,w,w') (2)
where ||-|| TV is the total variation norm (see H2.3l to clarify notation).
1 See for a much more general notion of concentration.
1
Let r and A be upper-triangular n x n matrices, with Ta = An = 1 and
= \/fjij , Ay = ijij
for 1 < i < j < n.
For the case where S = [0, 1] and d is the Euclidean metric on K n , Samson [H] showed that
if / : [0, 1]" — > K is convex and Lipschitz with ||/|| Li < 1, then
P{|/(X)-E/(X)|>t} < 2cxp^-^^ (3)
where ||r|| 2 is the £2 operator norm of the matrix T; Marton [o] has a comparable result.
For the case where S is countable and d is the (normalized) Hamming metric on «S n ,
1 "
i=l
Kontorovich and Ramanan jjj showed that if / : S n — > R is Lipschitz with ||/|| Li < 1, then
nt 2
P{|/(X)-E/(X)|>t} < 2exp^-^^j (4)
where ||A|| is the operator norm of the matrix A, also given by
ll A lloo = max (1 + fj iti+1 + . . . + fj it n). (5)
l<t<n
This is a strengthening of the Markov measure concentration result in Marton 0] .
These two results provide ample motivation for bounding fjij as a means of obtaining a
concentration result for a process. For Markov processes, Samson gives bounds on ||r|| 2 , while
Kontorovich and Ramanan bound HA^.
In this paper, we extend the technique in £Q to the case of hidden Markov processes. If
(Xi)±<i< n is a hidden Markov process whose underlying Markov process has contraction coeffi-
cients (#i)i<i< n , we will show that
fjij < 0i6 i+ f-6j-i. (6)
To our knowledge, this is the first concentration result for hidden Markov processes. In light of
the discussion in ^ the form of the bound - identical to the one in pQ and [H] for the simple
Markov case - should be at least somewhat surprising. Our result may be summarized by the
statement that a hidden Markov process is at least as concentrated as its underlying Markov
process.
2 Bounding rjij for hidden Markov processes
2.1 Definition of hidden Markov process
Consider two countable sets, S (the "hidden state" space) and S (the "observed state" space),
equipped with cr-algebras T = 2 s and T — 2 s , respectively. Let (S n ,J rn ,iJ,) be a probability
space, where fi is a Markov measure with transition kernels pi(- \ ■). Thus for x 6 S n , we have
n-l
H(x) = po{xi) Y[ Pk(£k+i I Xfc).
k=l
2
Suppose (S n x S 71 ,^ 71 x v) is a probability space whose measure v is defined by
n
v(x,x) = fj,(x)Y[qi(x e \xi), (7)
1=1
where qi(- \ x) is a probability measure on (S, T) for each i eS and 1 < £ < n. On this product
space we define the random process (Xi, Xi)i<a< n , which is clearly Markov since
F{(X i+1 ,X i+1 ) = (x,x)\(XiXi) = (y,y)} = Pi (x\yi)q i+l (x\x)
= p{(X i+1 ,X i+1 ) = (x,x) | (X^Xi) = (fe, W )} .
The (marginal) projection of (Xi,Xi) onto Xi results in a random process on the probability
space (S n ,T n ,p), where
p(x) =P{X = x}= x). (8)
The random process (Xj)i<i<„ (or measure p) on (S n 1 F n ) is called a hidden Markov process
(resp., measure); it is well known that (Xi) need not be Markov to any order 2 . We will refer to
(Xi) as the underlying process; it is Markov by construction.
2.2 Statement of result
Theorem 2.1. Let (Xi)i<i< n be a hidden Markov process, whose underlying process (Xi)i<i< n
is defined by the transition kernels pi(- \ ■). Define the k th (Doeblin) contraction coefficient
k by
9 k = sup \\p k (- 1 x) -p k (- 1 ^')ll T v •
Then for the hidden Markov process X , we have
Vij < • • •
for 1 < i < j i < n.
Remark 2.2. Modulo measurability issues, a hidden Markov process may be defined on continu-
ous hidden and observed state spaces; the definition of fjij is unchanged (we may weaken the sup
in J5J to ess sup; see [2j). For convenience, the proof of Theorem 12. II is given for the countable
case, but can straightforwardly be extended to the continuous one.
The bounds in J3J and are for different metric spaces and therefore not readily comparable
(the result in J3J has the additional convexity assumption; see [2] for a discussion). In the
special case where the underlying Markov process is contracting, i.e., di < < 1 for 1 < i < n,
Theorem 12.11 yields
fjij < V
In this case, Samson gives the bound
lirila < y^r,
1 — &2
2 One can easily construct a hidden Markov process over S = {0, 1,2} and S = {a, b} where, with probability 1,
consecutive runs of b will have even length. Such a process cannot be Markov.
3
and the bound
oo 1
H A IL < Y. 6k = —e
k=0
holds trivially via JHJ-
2.3 Notational conventions
Since the calculation is notationally intensive, we emphasize readability, sometimes at the slight
expense of formalistic precision.
The probability spaces in the proof are those defined in tl2.ll We will consistently distinguish
between hidden and observed state sequences, indicating the former with a Random variables
are capitalized (X), specified state sequences are written in lowercase (x), the shorthand Xj =
Xi . . . Xj is used for all sequences, and brackets denote sequence concatenation: [xj Xj +1 ) = x*.
Sums will range over the entire space of the summation variable; thus f{ x {) stands for
f{x{) with an analogous convention for f{x{)-
xiesi-^ 1 x{
The total variation norm ||-|| TV is defined here, for any signed measure r on a countable
set X, by
xex
The probability operator P {•} is defined with respect to (S n , T n 1 p) whose measure p is given
in JSJ). Lastly, we use the shorthand
H(u e k ) = p Q (uk) 1{k = 1} Y[pt(ut+i\ut)
t=k
K4i"fc) = n it(u t \ut)
t=k
p{u{) = F{Xi=4}.
2.4 Proof of main result
The proof of Theorem 12.11 is elementary - it basically amounts to careful bookkeeping of sum-
mation indices, rearrangement of sums, and probabilities marginalizing to 1. At the core is a
basic contraction result for Markov operators, which we quote as Lemma B.l of PQ, though it
has been known for quite some time (see references cited ibid.):
Lemma 2.3. For a countable set X, let u 6 M. x be such that Ylxex u ^ = 0> an( ^ A e ^■ XxX be
a column- stochastic matrix: A xy > for x, y € X and ^2 x€X A xy = 1 for all y £ X . Then
\\Au\\ TV < e A \\u
where 6a is the (Doeblin) contraction coefficient of A:
0a = \ sup Y] \A xy - A
4
Proof of Theorem \2.1\ For 1 < i < j < n, y\ 1 G S l 1 and w£ G 5, we use © to expand 3
mi (y\-\ wtM) = I E l p = I = tyi" 1 ^1 } - p i x ? = x l I x i = ivt 1 <] } I
,3-1
i+1
>{x& 1 = [z£}a$]\Xi = hi- 1 w' i ]})
hJ2
^E
EEm j "
z i+l
E E E E M Ci im*? 1 ^7>(4+i i CX^r 1 1 sr 1 )^)
where
I&)
Since a «^j| — Si a i|X)j f° r a « — and bi G K, we may bound
where
= IE^")K(%)I'
(10)
(11)
4+1 *&i si
si
Define the vector h G R 5 by
h, = ^E^r^Hyr 1 !^" 1 )-
(12)
Si
Then
C(%) = EE ^ l ^]) h '
Si •
z i+i H
Define the matrix G R^ x ^ by = pfc(w | v), for 1 < k < n. With this notation, we
have ((xj) — z^ j: where z 6 I 5 is given by
z = A^A^-V ■■■A^A^h. (13)
! Note that all the sums are absolutely convergent, so exchanging the order of summation is justified.
5
In order to apply Lemma fe.ijl to i|13|) . we must verify that
X> = 0, ||h|| TV <l. (14)
ves
From (|12|l we have
Vl
Summing over v, we get
E(^i))EM[a-'*])^-'ir') - PW= ;,-, roJ} E>.(«)Kb:-^]iffl
= l;
an analogous identity holds for the ttt=i ^vL term, which proves 1141) .
6 J p([y\ <]) ' — '
Therefore, combining l|ll|). 113|) . and Lemma T2. 31 we have
J
= §EKIE«)
< OiOi+i ■ ■ ■ Oj-i-
□
3 Discussion
The relative ease with which we were able to bound fjij is encouraging; it suggests that the
technique used in pp and here - namely, matrix algebra combined with the Markov contraction
lemma - could be applicable to other processes.
We noted in fJTJthat the bound for the hidden Markov process is identical to the one in £Q
and [H] for the simple Markov case. One might thus be tempted to pronounce Theorem 12. II as
"obvious" in retrospect, based on the intuition that the observed sequence Xj is an independent
process conditioned the hidden sequence X%. Thus, the reasoning might go, all the dependence
structure is contained in Xj, and it is not surprising that the underlying process alone suffices
to bound fjij - which, after all, is a measure of the dependence in the process.
Such an intuition, however, would be wrong, as it fails to carry over to the case where the
underlying process is not Markov. As a numerical example, take n = 4, S = S = {0, 1} and
define the probability measure \i on <S as given in Figure ^ Define the conditional probability
q(x | x) = jl {x=S; } + |l{ x /£}.
Together, \i and q define the measure p on 5 4 :
4
p(x) = e n y( xt i
6
Associate to (£> 4 ,/i) the "hidden" process (Xi)i and to (S 4 ,p) the "observed" process {Xi)\.
A straightforward numerical computation (whose explicit steps are given in the proof of The-
orem |OJ shows that the values of /i can be chosen so that %4p0 > 0.06 while fj24,(X) is
arbitrarily small.
Thus one cannot, in general, bound fjij(X) by cfjij(X) for some universal constant c; we were
rather fortunate to be able to do so in the hidden Markov case.
xV
0000
o
000000
0001
o
000000
0010
o
288413
0011
000000
0100
000000
0101
000000
0110
176290
0111
000000
1000
000000
1001
010514
1010
000000
1011
139447
1100
000000
1101
024783
1110
000000
1111
360553
Figure 1: The numerical values of /x on <S 4
Acknowledgements
I thank John Lafferty, Steven J. Miller and Kavita Ramanan for helpful comments on the draft,
and useful discussions.
References
[1] Leonid Kontorovich and Kavita Ramanan, "Concentration Inequalities for Dependent Ran-
dom Variables via the Martingale Method." http://arxiv.org/abs/math.PR/0609835
2006.
[2] Leonid Kontorovich, "Metric and Mixing Sufficient Conditions for Concentration of Mea-
sure." Paper in preparation, 2006.
[3] Michel Ledoux, The Concentration of Measure Phenomenon, Mathematical Surveys and
Monographs Vol. 89, American Mathematical Society, 2001.
[4] Katalin Marton, "Bounding d-distance by informational divergence: a method to prove
measure concentration." Ann. Probab., Vol. 24, No. 2, 857-866, 1996.
[5] Katalin Marton, "A measure concentration inequality for contracting Markov chains."
Geom. Funct. Anal, Vol. 6, 556-571, 1997.
7
[6] Paul-Marie Samson, "Concentration of measure inequalities for Markov chains and $-mixing
processes." Ann. Probab., Vol. 28, No. 1, 416-461, 2000.
8